4 research outputs found
AsterixDB: A Scalable, Open Source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a
feature set that distinguishes it from other platforms in today's open source
Big Data ecosystem. Its features make it well-suited to applications like web
data warehousing, social data storage and analysis, and other use cases related
to Big Data. AsterixDB has a flexible NoSQL style data model; a query language
that supports a wide range of queries; a scalable runtime; partitioned,
LSM-based data storage and indexing (including B+-tree, R-tree, and text
indexes); support for external as well as natively stored data; a rich set of
built-in types; support for fuzzy, spatial, and temporal types and queries; a
built-in notion of data feeds for ingestion of data; and transaction support
akin to that of a NoSQL store.
Development of AsterixDB began in 2009 and led to a mid-2013 initial open
source release. This paper is the first complete description of the resulting
open source AsterixDB system. Covered herein are the system's data model, its
query language, and its software architecture. Also included are a summary of
the current status of the project and a first glimpse into how AsterixDB
performs when compared to alternative technologies, including a parallel
relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data
analytics platform, for things that both technologies can do. Also included is
a brief description of some initial trials that the system has undergone and
the lessons learned (and plans laid) based on those early "customer"
engagements
Progressive Approach To Entity Resolution
Data-driven technologies such as decision support, analysis, and scientific discovery tools have become a critical component of many organizations and businesses. The effectiveness of such technologies, however, is closely tied to the quality of data on which they are applied. That is why today organizations spend a substantial percentage of their budgets on cleaning tasks such as removing duplicates, correcting errors, and filling missing values, to improve data quality prior to pushing data through the analysis pipeline. Entity resolution (ER), the process of identifying which entities in a dataset refer to the same real-world object, is a well-known data cleaning challenge. This process, however, is traditionally performed as an offline step prior to making the data available to analysis. Such an offline strategy is simply unsuitable for many emerging analytical applications that require low latency response (and thus can not tolerate delays caused by cleaning the entire dataset) and also in situations where the underlying resources are constrained or costly to use. To overcome these limitations, we study in this thesis a new paradigm for ER, which is that of progressive entity resolution. Progressive ER aims to resolve the dataset in such a way that maximizes the rate at which the data quality improves. This approach can help in substantially reducing the resolution cost since the ER process can be prematurely terminated whenever a satisfying level of quality is achieved.In this thesis, we explore two aspects of the ER problem and propose a progressive approach to each of them. In particular, we first propose a progressive approach to relational ER, wherein the input dataset consists of multiple entity-sets and relationships among them. We then propose a parallel approach to entity resolution using the popular MapReduce (MR) framework. The comprehensive empirical evaluation of the two proposed approaches demonstrates that they achieve high-quality results using limited amounts of resolution cost